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Abstract 

This paper presents parallel implementations of 
two codes used in a combined CFD/Kirchhoff method- 
ology to predict the aerodynamics and aeroacoustics 
properties of helicopters. The rotorcraft Navier-Stokes 
code, TURNS, computes the aerodynamic flowfield 
near the helicopter blades and the Kirchhoff acous- 
tics code computes the noise in the far field, using the 
TURNS solution as input. The overall parallel strategy 
adds MPI message passing calls to the existing serial 
codes to allow for communication between processors. 
As a result, the total code modifications required for 
parallel execution are relatively small. The biggest bot- 
tleneck in running the TURNS code in parallel comes 
from the LU-SGS algorithm that solves the implicit 
system of equations. We use a new hybrid domain de- 
composition implementation of LU-SGS to obtain good 
parallel performance on the SP-2. TURNS demon- 
strates excellent parallel speedups for quasi-steady and 
unsteady three-dimensional calculations of a helicopter 
blade in forward flight. The execution rate attained 
by the code on 114 processors is six times faster than 
the same cases run on one processor of the Cray C-90. 
The parallel Kirchhoff code also shows excellent parallel 
speedups and fast execution rates. As a performance 
demonstration, unsteady acoustic pressures are com- 
puted at 1886 far-field observer locations for a sample 
acoustics problem. The calculation requires over two 
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hundred hours of CPU time on one C-90 processor but 
takes only a few hours on 80 processors of the SP2. The 
resultant far-field acoustic field is analyzed with state- 
of-the-art audio and video rendering of the propagating 
acoustic signals. 

Introduction 

Accurate numerical simulation of the aerodynamic 
and aeroacoustic properties of rotorcraft in hover and 
forward flight is a complex and challenging problem lu 
recent years, three-dimensional unsteady Euler/ Navier- 
Stokes methods [1-5] have been proven to accurately 
compute the aerodynamics and aeroacoustics proper- 
ties in the region near the blade. The TURNS (Tran 
sonic Unsteady Rotor Navier Stokes) code, developed 
by Srinivasan et al. [3,4] at NASA Ames, is one such 
example. TURNS has been demonstrated in a mini 
ber of studies to compute accurately the aerodynamic 
flowfield and near-field acoustics, both high-speed im- 
pulsive (HSI) and blade-vortex interaction (BVI) noise, 
of a helicopter rotor (without fuselage) in hover and for- 
ward flight in subsonic and transonic conditions [3-10]. 
The code has been applied in the framework of overset 
grids [6,7] and has been coupled with an unstructured 
grid adaptive method [8] for better resolution of flow 
features. Unlike more traditionally-used full potential 
methods that require use of a linear wake model to re- 
solve the tip vortex, TURNS captures the tip vortex 
and the entire vortical wake as part of the overall flow- 
field solution. 

In addition to the desire for high aerodynamic per- 
formance, modern helicopter designs also aim for low 
noise. The two types of noise that cause problems for 
helicopters are high speed impulsive (HSI), character- 
ized by a strong acoustic disturbance occurring over 
a short period of time, and blade-vortex interactions 
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(BVTs) which occur when the rotor blades interact 
with their vortical wake systems. Prediction of far- 
field noise is not trivial, since traditional CFD methods 
are too dissipative to carry the near-field pressure sig- 
nals into the far field. Recent progress in the prediction 
of far-field helicopter noise have come from combined 
CFD/Kirchhoff methods. A CFD method, usually a 
Full Potential or Euler/Navier-Stokes solver, is used 
close to the rotor blades to capture the near-field acous- 
tic nonlinearities. The solutions are then interpolated 
onto a surface that surrounds the CFD domain and a 
Kirchhoff integration is performed to carry the acoustic 
signal into the far field. 

The advantage of the hybrid CFD/Kirchhoff meth- 
ods is that they account for nonlinear acoustic propaga- 
tion modeled by CFD close to the rotor blades, whereas 
away from the blades, a Kirchhoff integral avoids dissi- 
pation and dispersion errors associated with any CFD 
solver. By comparison, acoustic analogy methods use 
CFD information only on the blade surface and do not 
account for the nonlinearities (e.g. supersonic regions 
for transonically moving blades) near the blade. 

The CFD/Kirchhoff approach was first demon- 
strated for rotorcraft applications in two-dimensional 
cases [11-13] and has since been applied in three- 
dimensions [14-19]. Kirchhoff surfaces have been ap- 
plied in a rotating reference frame (rotating with the 
helicopter blades) [14,15,19] as well as a nonrotating 
frame [9,18], with the Kirchhoff surface enveloping the 
entire rotating hub and blades. Strawn et al. [20] 
showed both approaches give identical results. The 
Kirchhoff approach has shown a higher level of accu- 
racy in predicting far field noise to more traditionally 
used methods. Baeder et al. [9] showed the Kirch- 
hoff method to be more accurate than acoustic anal- 
ogy without quadrapoles for hover HSI noise with tip 
Mach numbers higher than 0.7. Lyrintzis et al. [21] also 
found Kirchhoff’s method to be more accurate than the 
acoustic analogy results in the WOPWOP code [22] for 
unsteady HSI noise, and about the same for low-speed 
BVI noise. A thorough review on the use of Kirch- 
hoff’s formula in computational aeroacoustics is given 
by Lyrintzis [23]. 

Although the combined CFD/Kirchhoff approach 
has been proven to give accurate results, it is compu- 
tationally demanding. A three dimensional unsteady 
Navier-Stokes computation with TURNS, and a Kirch- 
hoff computation at a large number of far-field observer 
points, can require many hours of CPU time on super- 
computers of Cray-class. 

Parallel computation offers the potential for 
cheaper and faster computations. Several vendors 
have released machines that utilize commodity RISC 
processors that are the same as those used in high- 
end workstations (e.g. IBM SP2, Intel Paragon, SGI 
Power Challenge). These machines generally attain 
better price/performance than vector processors and 
have peak execution rates that are several times faster 
than vector parallel machines like the Cray C-90. Some 
companies are beginning to utilize parallel processing in 
the form of networked workstations, that sit idle during 


off-hours, to attain supercomputer performance [24]. 

Parallel processing has not, in general, been well- 
received by the engineering community [25] due mainly 
to the difficulties in writing parallel programs and their 
lack of portability, and the extensive algorithm modifi- 
cations that are required for efficient parallelism. The 
first problem has been circumvented somewhat with the 
advent of standardized parallel languages (e.g. Message 
Passing Interface - MPI, Parallel Virtual Language - 
PVM) that have made parallel codes portable and eas- 
ier to implement. In this paper, we introduce algorithm 
modifications that make both the TURNS code and an 
existing Kirchhoff code parallelizable. The modifica- 
tions to the existing code are small and relatively easy 
to implement. 

Finally, although both the CFD and Kirchhoff 
solvers are applied in this paper to helicopter problems, 
the parallelization strategies are not unique to this ap- 
plication and could readily be used for other problems. 


Computational Fluid Dynamics Model 

The structured-grid Euler/Navier-Stokes solver 
TURNS [3,4] is used to compute the aerodynamic flow- 
field close to the helicopter rotor. This CFD code solves 
the Navier-Stokes equations about rotating helicopter 
blades. Since viscous effects are small for the cases 
considered in this paper, the TURNS code is run in the 
Euler mode. All nonlinear effects on the acoustic prop- 
agation are accurately modeled within the framework 
of the Euler equations. 

Srinivasan [1] has modified the TURNS code to 
compute the aerodynamics of isolated BVFs. Recently, 
Baeder and Srinivasan [10] used this modified version of 
TURNS to compute the BVI noise of an isolated rotor 
blade interacting with an upstream-generated vortex. 
They showed good agreement with the experimental 
data of Caradonna et al. [26] for time-dependent sur- 
face pressures during the BVI event. This same ver- 
sion of the TURNS code is used for the calculations in 
this paper. However, an additional module has been 
added to evaluate and store the appropriate pressure 
information on the Kirchhoff surfaces. More detailed 
information about the TURNS algorithm is given in 
the remainder of this section. 

The governing equations for the TURNS code 
are the unsteady, compressible, three-dimensional thin 
layer Navier-Stokes equations. These equations are 
applied in conservation form in a generalized body- 
conforming curvilinear coordinate system 

d r Q -f- d^E + drjF + d^G = -^—3^5 (1) 

where r = *, £ = £(x,y,z,f), V = and 

C = C(z. 2/> z , 0- The coordinate system (x, y , z, t ) is at- 
tached to the blade. The vector of conserved quantities 
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is Q, and the inviscid flux vectors E } F, and G are 
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where H = (e+p) and U , V, and W, are the contravari- 
ant velocity components (e.g. U = £< +£r«+£y u+£*w)- 
The cartesian velocity components u y v, and w are de- 
fined in the x, y, and z directions, respectively. The 
quantities £ tl £ y , and G are the coordinate trans- 
formation metrics and J is the Jacobian of the trans- 
formation. The pressure p is related to the conserved 
quantities through the perfect gas equation of state 

p = (7 - 1) { e - |(« 2 + y2 + ™ 2 )} ( 3 ) 


restricts the size of the time step. The two-factor LU- 
SGS method has enhanced stability along with a reduc- 
tion in factorization error (order At 2 ) that make it an 
attractive alternative. 

LU-SGS resembles a typical LU factorization 
scheme with diagonal preconditioning to increase ro- 
bustness. The scalar diagonal terms are obtained by 
use of approximate Jacobians, avoiding costly matrix 
inversions. First, the Jacobian terms A , 5, and C in 
Eq. (4) are split into “ + ” and “ - ” parts, with posi- 
tive parts constituting only the positive eigenvalues and 
negative parts constituting only the negative eigenval- 
ues. The positive matrix is backward differenced and 
the negative matrix is forward differenced, as follows 

= 6^A+ +6+ A~ (6) 

= Af-A +_ 1 + A- + 1 -A~ 

This splitting ensures diagonal dominance. The Jaco- 
bians are then approximated using the following spec- 
tral approximation 

A ± = ± paI) ± £PaI CO 


The viscous flux vector 5 is incorporated in the code 
but the calculations given in this paper are all inviscid 
(i.e. c = 0 in Eq. (1)) so the viscous terms are not 
described here. Details can be found in [3]. 

The inviscid fluxes are evaluated using Roe’s up- 
wind differencing [27] in all three directions. The use of 
upwinding obviates the need for user-specified artificial 
dissipation and improves the shock capturing in tran- 
sonic flowfields. Third order accuracy is obtained using 
van Leer’s MUSCL approach [28] and flux limiters are 
applied so the scheme is Total Variation Diminishing 
(TVD). 

The final Euler discretized form of Eq. (1) in im- 
plicit delta form is 

[I + h {6s A n + 6 rj B n + 6 C C n )] A Q n = -hRHS n (4) 
where 

RHS " = dsE n + drjF n + <9 c G n (5) 

/ is the identity matrix, h is the timestep and A Q n = 
Q n+1 — Q n . The 5x5 matrices A , B and C are the Jaco- 
bians of the flux vectors with respect to the conserved 
quantities (e.g. A = 


Implicit LU-SGS Operator 

TURNS uses the two-factor LU-SGS (Lower- 
Upper Symmetric Gauss Seidel) algorithm of Yoon and 
Jameson [29] for the implicit time step. The LU-SGS al- 
gorithm has been used in a number of well-known CFD 
codes (e.g. INS3D [30], OVERFLOW [31]) primarily 
for it’s stability properties. Classic implicit methods 
such as Beam- Warming approximate factorization have 
a large factorization error (of order A/ 3 ) which further 


where p A is the spectral radius of A (in the £ direction). 

p A = max [|A^|] = \U\ + a |V£| (8) 

£ is some small value (e.g. .001), and is defined as 

± f 1 if±(U±a |V<£|)>0 (9) 

s a S 0 otherwise v ' 

The same procedure is used in the y and ( directions 
to form the B and C terms. 

Substituting these approximate Jacobians into the 
implicit equation (4), the equation takes the form 

LD- l UAQ n = -hRHS" (10) 

where 

D = I + h (pA + PB + Pc)j'k,l 
L = D-h (Af_ y + + Cti) 

u = D + h (A- +l + b; + 1 + c,~ +l ) ( 11 ) 

D is a diagonal matrix, and the two-step LU decompo- 
sition can be performed by 

LAQ* = -hRHSA 

UAQ n = DAQ * (12) 

which can also be written as the following algorithm 
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Algorithm Is LU-SGS 


Do jf, k , / = 


!>•*•? Jmax j A max , L n 


= a- 1 * 

End Do 


+B*_ l 6Qr h _ l +C*_ l 6QU) 


Do j y kj / — Jmax j Kmax i 
SQ jikJ = 6Q* jtkI -D- l h 
End Do 


L>max )*••}! 

*K + i+C|+i«Wh 


Time Stepping 

Time stepping in TURNS is done using the implicit 
LU-SGS operator. The LU-SGS scheme introduces a 
factorization error of 0(At 2 ) and is consequently not 
time-accurate. A time-accurate solution can be deter- 
mined using inner iterations to relax the factorization 
error. Using the solution at time level n, the initial con- 
dition is set Q n +LO = Q n and the following equation is 
solved using LU-SGS 

[LD~ 1 U] n+1,m Aq = 

( n n + l,m _ Qn \ 

— — + RHS*+ l ' m J (13) 

where RHS is the same as Eq. (5) and 

Aq = Q"+Um+i __ Qn+i t m ( 14 ) 

Here, n refers to the time level and m to the itera- 
tion level. In our tests, three inner iterations were 
used at each timestep. Upon completion of the in- 
ner iterations, the solution at the next time level is 

Q"+i — Qn+ l,m max 


TURNS Parallel Implementation 

For parallel implementation of TURNS, we wanted 
to use an approach that would require relatively minor 
changes to the existing code, since the code is rela- 
tively long (6000 lines), and we wanted the resulting 
code to be portable. For these reasons, a MIMD (i.e. 
requiring message passing) approach is chosen over a 
SIMD (Single Instruction Single Data) or data-parallel 
approach. Message passing codes are generally more 
portable to different parallel architectures (e.g. from 
massively parallel supercomputers to workstation clus- 
ters) and the message passing subroutine calls can be 
added to the existing Fortran 77 code, as opposed to 
rewriting the entire code in a High Performance Fortran 
type language (e.g. CMFortran). To ensure portability, 
a set of generic message passing subroutines is used. 
With this protocol, the specific message passing com- 
mands can be altered in one line of the code rather 
than throughout, making conversion to different mes- 
sage passing languages, such as PVM (Parallel Virtual 
Machine), a relatively short procedure. 
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Figure 1: Partitioning the three-dimensional domain on 
a two-dimensional array of processors. 


A domain decomposition strategy that preserves 
the original construct of the code is used for paral- 
lel implementation. The domain is divided in the 
wraparound and spanwise directions into subdomains. 
The normal direction is left intact, so the entire flow- 
field domain is layed out on a two-dimensional array of 
processors, as demonstrated in Fig. 1. Data is commu- 
nicated between processors, and a single layer of ghost- 
cells is used to store this communicated data. Although 
this approach leads to non-square subdomains, and op- 
timal parallelism is achieved with square subdomains, 
the normal direction is left undivided to prevent a load 
imbalance from occurring during the application of the 
boundary conditions at the C-plane. An averaging of 
data is performed between nodes lying on either side of 
the plane that requires communication of data across 
this plane. If the normal direction was divided, all pro- 
cessors not holding data on the C-plane would sit idle 
while the data is communicated. In the present imple- 
mentation, all processors participate in this communi- 
cation, eliminating the load imbalance. 

There are essentially three main portions of the 
solution algorithm in TURNS. The first is the formation 
of RHS . The communication required for this step is 
trivial. After the flux vectors are determined using the 
MUSCL routine, they are communicated and stored in 
the ghost layer. Then, Roe-differencing is applied. This 
communication step could be avoided by using a ghost 
layer of two cells, but the present approach was easier 
to implement into the existing code and constitutes a 
small percentage of the overall communication. The 


4 











second portion of the solution algorithm is application 
of the boundary conditions. This can be done local to 
each processor, with exception to the averaging of data 
across the C-plane, described above. The third part 
of the algorithm is the implicit solve using LU-SGS. 
This portion of the code is not trivial to implement in 
parallel. 
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Figure 2: Domain sweeping strategy used by LU-SGS 
algorithm. Can vectorize on hyperplanes where j+k+l 
— const . 


In the LU-SGS algorithm (Algorithm 1), the first 
step sweeps in the positive direction updating bQ* . 
The second step then computes 6Q by sweeping back 
through the domain in the opposite direction. This al- 
gorithm can be vectorized using a hyperplane approach, 
as outlined in ref. [32]. Vectorization is done across hy- 
perplanes in which j+k+l=const , as outlined in Fig. 2. 
While the hyperplane approach leads to good vector 
execution rates, it is difficult to parallelize for two rea- 
sons; 1) the size of the hyperplanes vary throughout the 
grid, leading to load balancing problems, and 2) there 
is a recursion between the planes, leading to a large 
amount of communication. 

Parallelization of the LU-SGS algorithm has been 
addressed by other researchers. Barszcz et al. [34] im- 
plemented the LU-SSOR algorithm, which is similar to 
LU-SGS, on a parallel machine by restructuring the 
data-layout using a skew-hyperplane approach. Al- 
though they were able to extract good parallelism with 
this approach the data-layout is complex and the re- 
structuring of data on the left hand side in turn causes 
the right hand side layout to be skewed and extra com- 
munication is required. Several researchers have pro- 
posed modifications of the LU-SGS algorithm to make 
it more parallelizable. Candler and Wright [33-35] have 
investigated a modification called Data-Parallel LU Re- 
laxation (DP-LUR), which has shown excellent results 
in a data-parallel environment. Wong et al. [36] have 
investigated a domain decomposition implementation 
of LU-SGS. For two-dimensional steady state react- 
ing flow problems, they found that, while the conver- 
gence rate of the operator is reduced with the domain 
breakup, the affect is relatively weak (e.g. with 64 sub- 
domains, the number of iterations increases by less than 
20%). 

In refs. [37,38], the DP-LUR algorithm is imple- 
mented in place of LU-SGS in TURNS and tested for 
three-dimensional rotorcraft calculations on the Think- 
ing Machines CM-5 using a message- passing implemen- 
tation. The DP-LUR method showed excellent paral- 
lelism with a low percentage of communication but the 
method required about three times the work of LU-SGS 


to maintain the same convergence rate. 
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Figure 3: Jacobi Communication Strategy of the Hy- 
brid algorithm. Load balanced parallelism with nearest 
neighbor communication. 

We propose in ref. [37] a modification of DP-LUR 
that was found to be more efficient with the message 
passing approach. The original DP-LUR algorithm 
uses Jacobi sweeps in place of the Gauss-Seidel sweeps 
used in LU-SGS. This has the benefit that all com- 
putations can be done locally to each processor and 
very little communication is required. However, be- 
cause Jacobi has a slower convergence rate than Gauss- 
Seidel, multiple sweeps of the domain (e.g. 3-6) are 
performed to maintain convergence, and the algorithm 
consequently requires more work. In our modified ap- 
proach, the original Gauss-Seidel sweeps in LU-SGS are 
performed locally on each processor while retaining the 
Jacobi communication strategy used in DP-LUR for 
communications. Essentially, the modified algorithm 
entails using Symmetric Gauss-Seidel sweeps for on- 
processor computations and a Jacobi method for inter- 
processor communications. Figure 3 describes this idea. 

The multiple sweep idea is retained from DP-LUR 
to improve robustness for cases where the breakup of 
the domain causes convergence difficulties. Because 
this modified approach retains properties of both DP- 
LUR and LU-SGS, we refer to it as a Hybrid parallel 
implementation of LU-SGS. 


Algorithm 2: Hybrid LU-SGS 


6Q ( °l, = D~ l hRHS n 

For i —■ 1, . . . , ist uccp Do 

Communicate border SQ^\ data 
to neighboring processors. 


= tQjrf 


Perform LU-SGS sweeps (Algorithm 1) locally on 
each processor, computing b>Q^ kl 


End Do 


6Qi,kj = ^Q%T r) 


The Hybrid algorithm is easily implemented in the 
domain decomposition strategy. Since original LU-SGS 
is used for on-processor computations, little code modi- 
fication is required of the LU-SGS algorithm already in 
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TURNS. Message passing is used to communicate the 
border data amongst neighboring processors, and then 
the computations proceed locally. On a single proces- 
sor (with 1 sweep), the Hybrid method is identical to 
the original LU-SGS algorithm. 


Kirchhoff Integration Method 

The CFD/Kirchhoff method consists of a CFD so- 
lution, which is done in this case using TURNS, near 
the rotor blades followed by a Kirchhoff integration to 
propagate the acoustic signals to the far field. Since 
viscous effects are small for the cases considered in this 
paper, the TURNS code is run in the Euler mode. All 
nonlinear effects on the acoustic propagation are ac- 
curately modeled within the framework of the Euler 
equations. 

The Kirchhoff formulation from Farassat and My- 
ers [39] is used to evaluate the acoustic pressure, P, at 
a fixed observer location, z, and observer time, t. In 
the general case, the Kirchhoff surface, S, can deform 
and can also have arbitrary motion. The formula can 
be written as: 
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(16) 


(17) 


The Kirchhoff surface is assumed to be moving with 
Mach number M . The distance between a point on the 
Kirchhoff surface and the observer is given by | rj. M n 

and Af r are the components of M along the local surface 
normal, n, and the radiation direction, r. M t is the 
Mach number tangent to the Kirchhoff surface, P n is 
the derivative of P along the surface normal, and V 2 P 
is the gradient of the pressure on the Kirchhoff surface. 
The freestream speed of sound is assumed uniform at 
aoo, and 6 is the angle between n and r. The dot over 
any quantity denotes its partial derivative with respect 
to time, t. The terms h r and tim are defined as n * r 
and n • A?, respectively. The simplified form for E 2 in 
Eq. (17) is taken from Myers and Hausmann [40]. 

Figure 4 shows a representative Kirchhoff surface 
in the rotating reference frame. The surface moves with 
the rotor blade and coincides exactly with a portion of 
the CFD mesh. The pressure and its spatial and tempo- 
ral derivatives on the Kirchhoff surface are computed by 


the TURNS code and stored at every azimuthal degree 
around the rotor disk. The pressure gradients are com- 
puted using the same coordinate transformation met- 
rics that are used to evaluate the fluxes in the flow 
solver. The chain rule is used to convert the pressure 
gradients from computational space to inertial space for 
the Kirchhoff integration. 



Figure 4: The rotating Kirchhoff surface is attached to 
the helicopter blade. 

The Farassat and Myers [39] Kirchhoff formulation 
in Eqs. (15-17) allows both rotation and translation of 
the Kirchhoff surface. Note that the entire integral in 
Eq. (15) must be evaluated at the time of emission for 
the acoustic signal, r, which is sometimes called the 
retarded time. The calculation of the proper retarded 
time values for each point on the surface is the most 
difficult step in the evaluation of the Kirchhoff integral. 
Extensive details are given in Ref. [20] . 

One major problem with this rotating-surface 
Kirchhoff method is that Eq. (14) requires the surface 
to move subsonically. This requirement places a restric- 
tion on the outer radial location for the rotating surface. 
If it is too far from the rotor blade tip, it will have su- 
personic motion, which is not allowed. On the other 
hand, if the surface is too close, the acoustic solution 
may be inaccurate. To overcome this problem, a second 
implementation uses a nonrotating Kirchhoff surface 
that completely surrounds the spinning rotor blades. 
Once the rotational motion is removed from the Kirch- 
hoff surface, its translational motion is always subsonic. 
However, one disadvantage of this formulation is that 
it requires interpolations from the rotating-grid CFD 
solution onto the nonrotating Kirchhoff surface. 

Reference [20] presents results for both types of 
Kirchhoff surfaces. Excellent agreement with experi- 
mental data was obtained for both HSI and BVI noise 
with each Kirchhoff method. Figure 5 shows one of 
the comparisons from that paper for HSI noise from 
the Army’s AH-1 helicopter with the OLS rotor blades. 
Computed far-field noise shows excellent agreement 
with the experimental data which is taken from Schmitz 
et al. [41]. 


Kirchhoff Parallel Implementation 

Implementation of the rotating and nonrotating 
Kirchhoff integrations on parallel computers is straight- 
forward due to the regular nature of the computations. 
The Kirchhoff surface is divided into patches and each 
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Figure 5: Comparison of acoustic pressures with experimental data at four different microphone locations for an 
AH-1 blade with M at = 0.837. All microphones are in the plane of the rotor. 


patch is assigned to a processor. Virtually no commu- 
nication is needed among processors except for a final 
summation of the integral contributions from each sur- 
face patch. In this paper, parallel results are presented 
only for the rotating Kirchhoff code. Work continues 
on implementing the nonrotating Kirchhoff code and 
results will be presented in reference [42]. 

The Kirchhoff integration is implemented using 
a MIMD message passing paradigm and is portable 
to any distributed-memory architecture that supports 
FORTRAN 90 and the MPI library. Details of the im- 
plementation are as follows. The Kirchhoff surface and 
associated pressure files are stored as two dimensional 
arrays of size (imax , jmax) } which must be partitioned 
and distributed among the processors. A “host” pro- 
cessor is designated to read and scatter the data to the 
working set, P } of processors. Each processor receives a 
contiguous one-dimensional strip of ( imax x jmax)/P 
data elements, plus an additional overlapped section of 
imax elements. This allows local, communication-free, 
nearest-neighbor computations. Note that the data 
could have also been distributed in two-dimensional 
blocks of size (imax / P, jmax / P ) , thereby reducing the 


space required for the overlapped elements while ad- 
versely affecting cache performance and complicating 
the local index calculations. Due to the relatively small 
size of the rows, the former less complex approach was 
taken. 

Results 

Both TURNS and the Kirchhoff code are imple- 
mented on the 160 node IBM SP2 multiprocessor lo- 
cated at NASA Ames. The standard MPI (Message 
Passing Interface) library is used for message passing 
in the codes. The TURNS source code is written in 
Fortran 77 while the Kirchhoff code is in Fortran 90. 
The codes are combined to compute the near-field CFD 
solution and far-field acoustic patterns for a nonlift- 
ing symmetric OLS blade rotating with advancing tip 
Mach number M a t = 0.837 (tip Mach number 0.665 and 
freestream Mach number 0.172). The OLS blade has a 
sectional airfoil thickness to chord ratio of 9.71% and is 
a 1/7 scale model of the main rotor for the Army's AH- 
1 helicopter. The following sections present results of 
the CFD solution using TURNS and the far-field noise 
prediction using the Kirchhoff method. 
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Figure 6: Upper half of the 135 x 50 x 35 C-H type grid 
used for OLS airfoil calculations on the CM-5. 


TURNS Results 

The TURNS code is run in Euler mode, using a 
135 x 50 x 35 C-H type grid, with the domain extending 
eight chords in all directions. The upper half of the grid 
is shown in Fig. 6. 

Before an unsteady computation can be per- 
formed, a quasi-steady starting solution must first be 
computed. In the quasi-steady case, the blade is 
held fixed at zero degrees azimuth while the tip and 
freesteam conditions are the set to that for the cor- 
responding unsteady solution. A spatially varying 
timestep, described in [33], is used to generate the 
quasi-steady solution. Since the solution is not time- 
accurate, the quasi-steady case provides a convenient 
way to quantify the convergence qualities of the Hybrid 
modification of LU-SGS on different processor parti- 
tions. 

TURNS is run for the quasi-steady case on 1, 4, 8, 
19, 57, and 114 processors. For the 4-8 processor cases, 
the wraparound direction in the grid is left intact while 
the spanwise direction is subdivided into 4 and 8 subdo- 
mains, respectively. For the 19, 57, and 114 processor 
cases, the wraparound direction is subdivided into a 19 
subdomains, and the spanwise direction is subdivided 
into 3 and 6 subdomains, respectively. Due to the spe- 
cific grid used, we were unable to use a more variable 
distribution of processors in the wraparound direction. 
The only factors of the 133 points in the wraparound 
direction are 7 and 19. With 7 subdomains, a pro- 
cessor boundary is placed at the trailing edge of the 
blade, which results in adverse convergence difficulties. 
Thus, we were forced to use only 19 subdomains in the 
wraparound direction. Use of other meshes in future 
calculations should allow more versatility to the choice 
of processor subdomains. 

The convergence criteria used for the quasi-steady 
case is the global L2 density norm is reduced by 
three orders of magnitude to 10“ 8 . The iterations are 
stopped when this criteria is reached. The number of it- 
erations required to meet this convergence criteria, the 
time per iteration and parallel speedup of the time per 
iteration, the percentage of time spent in communica- 
tion, and total time statistics are given for this quasi- 
steady case in Table 1. The results presented use 1 in- 


ner sweep in the hybrid LU-SGS algorithm. These were 
found to give the best CPU time results. In ref. [37], 
results of these same cases run on the Thinking Ma- 
chines CM-5 are presented. The results from the CM-5 
also indicated that 1 sweep gives the best overall CPU 
time. 

The convergence characteristics of TURNS for this 
quasi-steady case are shown in plots of the global den- 
sity L2 norm and the maximum density vs. number of 
iterations in Figs. 7 and 8, respectively. The behavior 
of the maximum residual, after 500 iterations, is essen- 
tially the same for all processor partitions tested. Some 
difference is observed in the behavior of the L2 density 
norm, however. With 1, 4, and 8 processors, the con- 
vergence curves are nearly identical. With 19, 57, and 
114 processors, the convergence curves are also nearly 
identical, but are worse than that of 1 processor. On 1 
processor, the convergence rate of the density norm is 
relatively constant until it reaches 3 x 10“ 8 , at which 
time there is a leveling off and the method shows a new, 
slower convergence rate. Similar behavior is shown with 
19 processors, but the point at which the leveling off oc- 
curs is at about 1 x 10" 7 . As a result, approximately 
30% more iterations are required to achieve the conver- 
gence criteria for the 19, 57, and 114 processor cases. 

It is unknown to us why the density norm con- 
vergence phenomenon occurs in this way, but it does 
illustrate the difference in subdividing domains in the 
wraparound and spanwise directions. With the 1-8 pro- 
cessor cases and and the 19-114 processor cases, where 
the number of subdomains in the wraparound direction 
is kept constant and the number in the spanwise direc- 
tion is varied, there is little difference in convergence. 
There is, however, substantial difference between the 1 
and 19 processor cases, where the spanwise direction 
is left intact while the wraparound direction is subdi- 
vided. This makes physical sense since the flow vari- 
ables have larger gradients in the wraparound direc- 
tion during convergence, so subdividing in that direc- 
tion should cause a reduction in convergence. 

On 1 processor of the SP2, TURNS was found to 
have an execution rate of 21.5 M flops/second. This 
is considerably lower than the 100 Mflops/second we 
were able to achieve with basic test codes, indicating 
TURNS is not optimized for the RISC chip architecture 
used on the nodes of the SP2. We expect this is due 
to portions of unstreamed data in certain parts of the 
code which cause cache misses. To further support this 
argument, superlinear parallel speedups are exhibited 
by the code, which is usually indicative of cache misses. 
Due the the length of the code, we have not yet made 
substantial efforts to restructure inefficient portions of 
the code but will look into this further in the future. 
It should be noted, however, that TURNS is a long 
and complex code; Despite modifications to improve 
vectorization for execution on the Cray C-90, the code 
still runs at only one quarter the peak speed, on one 
processor. Therefore, it should be expected that the 
code will not run at the peak rate on the RISC processor 
either. 

Despite the poor single processor performance, 
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Residual 2-Norm 


Table 1: Table 1 - Timing Results on the IBM SP2 for quasi-steady inviscid calculation with TURNS. Nonlifting 
OLS blade with M at = 0.837, 135 x 50 x 35 mesh, global density norm reduced to 10“ 8 . 


Processors 

# Wraparound 

#Spanwise 

Iterations 

Time/It 

Speedup 

% Comm. 

Tot. Time 

1 

1 

i 


15.19 sec 


Wo 

537.5 min 

4 

1 

4 

2130 

3.92 sec 

3.9 

1.9% 

139.2 min 

8 

i 

8 

2165 

2.08 sec 

7.3 

4.0% 

75.1 min 

19 

19 

1 

2769 

0.74 sec 

20.6 

12.3% 

34.2 min 

57 

19 

3 

2771 

0.26 sec 

59.1 

16.4% 

12.0 min 

114 

19 

6 

2783 

0.14 sec 

107.8 

20.0% 

6.5 min 



Figure 7: Convergence of global density L2 norm for 
quasi-steady problem with different processor parti- 
tions of the SP2. 


the code exhibits good parallel speedups and has an 
overall fast execution rate. On 114 processors, the 
time/iteration of the code is about six times faster than 
the performance attained on one processor of the Cray 
C-90. Although more iterations are required on the SP- 
2 for this case, the entire calculation is done in about 
6 minutes. By comparison, this calculation requires 
about 30 minutes on the Cray C-90. 

Using the quasi-steady solution as a starting so- 
lution, a time-accurate unsteady flow calculation for 
one complete revolution of the AH-1 blade is run with 
TURNS. The same solution mesh is used to start the 
unsteady calculation, but the mesh is set to rotate with 
the blade. A timestep corresponding to 0.25 deg az- 
imuth/ timestep is used, so a complete revolution is 
done in 1440 timesteps. Three inner relaxation iter- 
ations are performed in each timestep. 

Table 2 shows the time/timestep, parallel speedup, 
percentage of communication, and total times for the 
unsteady case run on the same processor partitions used 
in the quasi-steady case. A plot of the global density 
residual norm vs. time for the unsteady case is shown 
in Fig. 9 and the parallel speedup exhibited by the code 
on different processor partitions of the SP2 for this case 



Figure 8: Convergence of maximum global density for 
quasi-steady problem with different processor parti- 
tions of the SP2. 

is shown in Fig. 10. 

The time/timestep for the unsteady case is approx- 
imately three times that for the quasi-steady run, due 
to the use of 3 subiterations at each timestep. The 
percentage communication and parallel speedups are 
nearly the same as in the quasi-steady run. The do- 
main breakup has a similar effect on the global density 
residual for the unsteady case as it did for the quasi- 
steady case; The convergence on 4 and 8 processors is 
essentially the same as 1 processor, while on 19-114 
processors, there is a noticeable reduction. We antici- 
pate that using more sweeps inside the Hybrid method 
will reduce this effect and we are investigating this fur- 
ther. 

The benefit of parallel computation is apparent 
from the overall solution time. On one processor of 
the Cray C-90, this calculation required a little over 
an hour. On 114 processors of the SP2, the calcula- 
tion time is only 10 minutes, a reduction by a factor of 
about 6. 

Kirchhoff Code Results 

The computation time for a single evaluation of a 
far-field observer pressure on one processor of the SP2 
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Table 2: Table 2 - Timing Results for a 1 revolution unsteady inviscid calculation with TURNS on the SP2. OLS 
blade, M at = 0.837, 3 inner iterations per timestep, 1440 timesteps with each timestep = 0.25 deg azimuth. 


Processors 

#Wraparound 

#Spanwise 

Time/Timestep 

Speedup 

% Comm. 

Tot. Time 

1 

1 

i 

45.48 sec 


— Wo — 

1092 min 

4 

1 

4 

11.72 sec 

3.9 

2.0% 

281.3 min 

8 

i 

8 

6.26 sec 

7.3 

3.7% 

150.2 min 

19 

19 

1 

2.16 sec 

21.1 

9.8% 

51.8 min 

57 

19 

3 

0.74 sec 

61.4 

13.1% 

17.8 min 

114 

19 

6 

0.43 sec 

106.5 

20.8% 

10.3 min 



Figure 9: Global density norm history for unsteady cal- 
culation on the SP2. One revolution (1440 timesteps 
at 0.25 deg azimuth/timestep) of AH-1 blade with 
M a t = 0.837. 

is about 1.6 seconds. This compares very favorably 
with a CPU requirement of 1.5 seconds on a single pro- 
cessor of a Cray C-90 computer. The reason for this 
excellent performance of the SP2 relative to the C-90 
is that each integration point on the Kirchhoff surface 
requires an iterative solution for the nonlinear retarded 
time equation. This retarded time calculation does not 
vectorize on the C-90 and the performance is therefore 
poor. The SP2 currently does not have a vector pro- 
cessing capability; hence, the solution to the retarded 
time equation does not adversely affect performance. 
Since no communication is required until a final global 
summation, and the I/O is independent of the number 
of processors, the actual run time on the SP2 scales by 
over 99% as the number of processors increases. As a 
result, with 80 processors on the SP2, we can compute 
a periodic time history of 360 pressure evaluations for 
1886 observers in few hours. On one processor of the 
Cray C90, this calculation would require over 200 hours 
of CPU time. 

Visual Postprocessing 
Figure 11 shows a sample grid of far-field observer 


Parallel Speedup 



Figure 10: Parallel speedups on SP2 for unsteady cal- 
culation. 

locations for the OLS rotor blade. These grid points 
are located in the plane of the rotor between 4 and 
7 blade radii from the rotor hub. Overall, this grid 
contains a total of 1886 observer points. Time histories 
of acoustic pressure were computed for each of these 
observer locations using the parallel KirchhofT code on 
the SP2. These unsteady results are then animated and 
visualized on an SGI graphics workstation. 



Figure 11: Far-field grid of observer points for the 
Kirchhoff acoustics simulation. 

Figure 12 shows four snapshots from an animation 
of the resulting acoustic field. The acoustic pressure 
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Figure 12: Contours of computed far-field acoustic pressure for several different blade azimuth positions. The 
pressure contours are scaled by the radius from the rotor hub. 


fields are scaled by the distance from the rotor hub, 
which helps to show the directivity of the noise in the 
far field. At each blade azimuthal location, the maxi- 
mum scaled acoustic amplitude contours contours show 
that the maximum directivity of the acoustic signal is 
along the direction of flight, directly ahead of the rotor 
blade and slightly to the advancing side. This is consis- 
tent with data obtained from windtunnel experiments 
and flight tests for similar conditions. 

The animated results give a clear picture of the un- 
steady propagation of acoustic signals in the far field. 
The advantage to this approach is that the entire field 
can be viewed at once. This conveys much more in- 
formation to the viewer than what can be seen from a 
handful of far-field experimental microphones. The im- 
portance of this additional information is even greater 
for BVI noise where the far-field acoustic propagation 
is much more complex than that seen in Fig. 12. 

Audio Postprocessing 

Time-dependent CFD/Kirchhoff rotorcraft acous- 
tics simulations are well-suited for audio playback. The 
CFD/Kirchhoff integration provides pressure data at 
discrete locations in the far-field such as those shown 
in Fig. 11. Audio playback is obtained by using the 
Stereophonic Acoustics Software Library that has been 


recently developed at NASA Ames Research Center 
This software package allows the user to interactively 
select individual grid points where pressure data can be 
played over headphones or stereo speakers connected to 
an SGI Indigo computer. The digital pressures for tie- 
entire field are scanned for maximum and minimum am 
plitudes and then scaled for 16-bit stereo sound I In- 
user can add a slight phase shift between the signals for 
each ear. This phase shift simulates the inter-aural time 
delay that conveys spatialized acoustic perceptions to 
the human brain. 

The sample results in Fig. 12 have been processed 
for the audio playback described above. Although we 
cannot effectively convey the audio results in this paper, 
the consensus opinion of those who have heard it is that 
the simulation is very realistic. A video will be shown 
at the presentation demonstrating this. 

Concluding Remarks 

An efficient parallelization of two codes used in 
a combined CFD/Kirchhoff methodology is presented. 
An Euler/ Navier Stokes rotorcraft CFD code called 
TURNS and an existing rotating Kirchhoff code are 
both implemented on the 160 node IBM SP2 multi- 
processor at NASA Ames. The two codes are used to- 
gether to predict both the near-field aerodynamics and 
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far-field aeroacoustic characteristics of an AH-1 blade 
rotating with M at = 0.837. TURNS is used to com- 
pute the inviscid flowfield in the region near the blades 
and the Kirchhoff code, using the CFD results, com- 
putes the far-field noise characteristics. Some algorithm 
changes are required to allow the codes to have efficient 
parallel execution, but a domain decomposition imple- 
mentation is used that requires relatively small mod- 
ifications to the existing codes. Communications are 
performed by adding standard MPI message passing 
calls to to the Fortran code, and techniques are used to 
ensure easy portability. 

The main portion of TURNS that presents diffi- 
culty for parallel implementation is the LU-SGS al- 
gorithm, used for the implicit timestep. A modified 
Hybrid domain decomposition implementation of LU- 
SGS is utilized. Although some reduction in conver- 
gence is observed with this implicit algorithm, the re- 
duction is small and the method shows excellent paral- 
lel speedups. Three dimensional quasi-steady and un- 
steady calculations are performed with TURNS using 
up to 114 processors, and the time/iteration for these 
cases is reduced by a factor of 6 over the Cray C-90. 

Implementation of the Kirchhoff solver on the SP2 
is relatively straightforward due to the regular nature 
of the computations. The method is nearly as efficient 
on 1 processor of the SP2 as on 1 processor of the Cray 
C-90 and shows nearly perfect parallel speedups on the 
SP2. The unsteady far-field acoustic properties are cal- 
culated at 1886 observer locations, and visual as well 
as audio postprocessing is performed using the Stereo- 
phonic Acoustics Software Library at NASA Ames to 
gain better insight to the far-field noise characteristics. 
These calculations, which would take over 200 hours of 
CPU time on one processor of the Cray C-90, are per- 
formed on the SP2 in a few hours. Ref. [42] will contain 
a more complete description and more results from the 
parallel Kirchhoff method and the associated video and 
audio postprocessing. 

Finally, although the results presented here are for 
a relatively simple nonlifting two-blade configuration, 
the same CFD/Kirchhoff algorithms discussed in this 
paper are used on more advanced configurations such 
as the V-22 tiltrotor [43]. The use of parallel process- 
ing will allow larger and more complex problems to be 
solved in the future. Also, while the CFD/Kirchhoff 
approach is applied here to helicopter noise prediction, 
the parallelization strategies are not unique to this ap- 
plication and the approaches could readily be applied 
to other problems. 
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